|
CVE considers that
any field containing non numeric characters to
be invalid. Therefore to use dates within CVE it
should be changed to number format in Excel
before saving as a csv file. Excel calculates
dates by counting days from 1/1/1900, the
decimal places represent the hours, minutes and
seconds. If necessary the decimal part may be
split into these groups.
We recommend you
use the comma separated variable or CSV format.
CVE also accepts data as space separated and tab
delimited variables. There is also a native data
format which may be identified by the .dat file
extension. The .csv file format is compatible
with all spreadsheets and most importantly
preserves missing data which may have been lost
in a space or tab delimited file.
How much data should I collect?
Due to the
complexity of statistical analysis, data
collection has been a difficult task. Analysts
have had to be selective when starting new
projects to ensure that the data collected is
relevant to the research. CVE has removed these
restrictions opening the way for much larger
data sets and larger problems to be analysed.
One will never appreciate the true power of CVE
if only investigating a few variables. That is
not to say that CVE is not a capable or powerful
tool for this level of data, It is much better
when dealing with large projects.
-
Consider
looking at systems rather than the subsystem
you suspect is causing the problem.
-
Including
information from variables not directly
connected to a problem permits one to see
the wider picture and very rapidly identify
unexpected variables.
-
30 - 40
variables can easily be plotted and many
more are easily managed. Including as many
variables as possible at the start of the
project will make future analysis work
easier and quicker.
The permutation
and variable management tools within CVE make it
simple to hide and reorder variables to only
show those that appear useful.
What data should I collect?
The answer to this
question depends largely upon the type of
process and timescale.
For example
consider a gas turbine. The residence time of
fuel gas in the turbine is very short, just a
fraction of a second, but the run time is very
long, up to a year between shutdowns. To
investigate reduced performance due to blade
fowling one might look at weekly data for a few
years. When investigating NOx emissions 30
second data may be more appropriate. Whatever
the period chosen one should aim for 1000 to
5000 measures. This is more than sufficient to
give a solid impression of the operating
characteristic. Using less data will produce
excellent results; as little as 100 rows can
produce some startling conclusions. When the
data available is less than 100 rows there is
still plenty of information available, but
extreme or unusual events can sometimes be given
more weight than they might warrant.
When collecting
data consider the number of variables you have
chosen. When plotting a graph in two dimensions
just two points will show a line, add another
point and that line might appear as a curve, add
another and the line might be straight but with
definable error. Add another and the error
becomes clearer, each time you add a line the
level of information in the graph grows, but the
significance of each individual point
diminishes. Remember that a 25 variable parallel
coordinate plot is showing in that one view the
equivalent of 300 Cartesian plots, so more data
points are required.
Yes, variables may
be added which are a function of other variables
contained within the data set. The variables
used to define a new variable do not need to be
visible in the parallel plot. To create a new
variable, select the expression variable from
the variables menu, change the label and enter
the algebraic expression for the variable
definition.
CVE also permits
you to add other forms of variable, including
index and clustering.
A saved state is a
record of where in the analysis you have reached
and the steps taken to get there. it s very
useful for pausing and restarting, or storing a
position when there are multiple analysis
options. For more information read the tip on
saving your current position
Bad data, missing
values and non-numeric are handled in exactly
the same way. By default CVE gives each a value
of 5% below the minimum for that variable; this
make the variable easily identifiable. You may
select these points to be removed completely or
set them to a different value. |